r/UMD is the official subreddit (sub-community of the popular social media news aggregation website Reddit) for the University of Maryland, College Park. Simply by looking at the front page of r/UMD, we can see that the community was first created on April 15, 2010, and there are 20,789 Reddit users who have joined it. For this data analysis project, we'll be digging deeper, analyzing the posts, comments, and the users of r/UMD themselves to find meaningful insights about the subreddit.
Note: throughout this Jupyter Notebook, all of our plots will be created with the Plotly Python Open Source Graphing Library. We have chosen to use this library to create our plots because it allows each plot to be interactive. Simply hover your cursor over any portion of the graphic to view the data at that point, and click and drag within the plot to zoom in on specific portions. To zoom out, double-click within the plot.
# Here are are the installations/imports that we will be using throughout this project.
# Their uses will be made apparent as we utilize them.
!pip install nltk
!pip install praw
!pip install psaw
!pip install plotly
!pip install vaderSentiment
import datetime as dt
from datetime import timedelta, datetime
import time
import praw
import sqlite3
from sqlite3 import Error
import pandas as pd
from psaw import PushshiftAPI
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import statsmodels.stats.proportion as smp
import statsmodels.formula.api as smf
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from random import randint
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from math import log
Below we will outline our process for retrieving the necessary information from Reddit. We begin by connecting to a Python wrapper for the Pushshift API that allows us to access Reddit data. We define a couple of functions to help us simplify using sqlite3 later on, and then we outline our SQL statements for creating our various tables and execute those statements.
# Connecting to the API
r = praw.Reddit(client_id=*******,
client_secret=*******,
user_agent=*******)
api = PushshiftAPI(r)
# create a database connection to the SQLite database specified by db_file
def create_connection(db_file):
conn = None
try:
conn = sqlite3.connect(db_file)
return conn
except Error as e:
print(e)
return conn
# create a table from the create_table_sql statement
def create_table(conn, create_table_sql):
try:
c = conn.cursor()
c.execute(create_table_sql)
except Error as e:
print(e)
# variables for commands for creating SQL tables
sql_create_user_table = """ CREATE TABLE IF NOT EXISTS User (
name text PRIMARY KEY,
flair text,
created_utc float NOT NULL
); """
sql_create_user_subreddits_table = """ CREATE TABLE IF NOT EXISTS UserSubreddits (
name text,
subreddit text,
FOREIGN KEY (name) references User (name)
); """
sql_create_post_table = """CREATE TABLE IF NOT EXISTS Post (
id text PRIMARY KEY,
name text NOT NULL,
url text,
title text,
selftext text,
score integer NOT NULL,
created_utc float NOT NULL,
permalink text,
link_flair_text text,
FOREIGN KEY (name) REFERENCES User (name)
);"""
sql_create_comment_table = """CREATE TABLE IF NOT EXISTS Comment (
id text PRIMARY KEY,
name text NOT NULL,
body text,
score integer NOT NULL,
parent_id text NOT NULL,
link_id text NOT NULL,
created_utc float NOT NULL,
FOREIGN KEY (name) REFERENCES User (name),
FOREIGN KEY (parent_id) REFERENCES Comment (id),
FOREIGN KEY (link_id) REFERENCES Post (id)
);"""
# create a database connection
conn = create_connection("./R_UMD.db")
# create tables
if conn is not None:
create_table(conn, sql_create_user_table)
create_table(conn, sql_create_user_subreddits_table)
create_table(conn, sql_create_post_table)
create_table(conn, sql_create_comment_table)
conn.close()
else:
print("Error! cannot create the database connection.")
Our scraping will take place in 3 phases:
For (1), we will make a request for all submissions on r/UMD after Jan 1, 2010 (before r/UMD existed) and store them in the database. We will also keep track of the users and their information along the way.
# create tables
if conn is not None:
# r/UMD was born April of 2010, get all posts from then on
start_epoch = int(dt.datetime(2010, 1, 1).timestamp())
# actual request to API
# we are first looking for 'submissions', i.e. posts/top-level comments
results = list(api.search_submissions(after=start_epoch,
subreddit='UMD',
filter=['url','author', 'title', 'subreddit'],
limit=None))
# for each result put appropriate information in the appropriate table
for res in results :
# we will first add the user if they aren't already in there
# we will deal will the user flairs later
user_task = (str(res.author), res.created_utc)
user_sql = ''' INSERT or IGNORE INTO User(name,created_utc)
VALUES(?,?) '''
# we will then deal with adding the information from the post to the post table
post_task = (res.id, str(res.author), res.url, res.title, str(res.selftext), res.score,
res.created_utc, str(res.permalink), str(res.link_flair_text))
post_sql = ''' INSERT or IGNORE INTO Post(id,name,url,title,selftext,score,created_utc,permalink,link_flair_text)
VALUES(?,?,?,?,?,?,?,?,?) '''
# try executing SQL statements above
cur = conn.cursor()
try :
cur.execute(user_sql, user_task)
cur.execute(post_sql, post_task)
except :
cur.close()
conn.close()
# commit additions to the DB
conn.commit()
# close connection for now
conn.close()
else:
print("Error! cannot create the database connection.")
Next, we need to get a table of all comments from r/UMD. To do that, we first read all of our submissions into a Pandas DataFrame to make accesses quicker, and then get the comments for each of the submissions we just scraped. This one takes a couple of hours, so go check out r/UMD and read some for yourself!
# create a database connection and make a dataframe so we can access the submissions quicker
conn = create_connection("./R_UMD.db")
df = pd.read_sql("SELECT * FROM Post", conn)
# create tables
if conn is not None :
# we need to loop through all of the submissions that we just collected in order to get their comments
for i, row in df.iterrows() :
# actual call to the API, we can get the submission as an object and read its subseequent comments as a list
sub = r.submission(id=row['id'])
comment_list = sub.comments.list()
# we will add EVERY comment into the database, along with user information (new users) and flair data if available
for comment in comment_list :
# add the comment to the comment tables
comment_task = (str(comment.id), str(comment.author), comment.body, comment.score,
comment.parent_id, comment.link_id, comment.created_utc)
comment_sql = ''' INSERT or IGNORE INTO Comment(id,name,body,score,parent_id,link_id,created_utc)
VALUES(?,?,?,?,?,?,?) '''
# we will need to add the user if they are not already in the user table
user_task = (str(comment.author), comment.author.created_utc)
user_sql = ''' INSERT or IGNORE INTO User(name,created_utc)
VALUES(?,?) '''
# if we can get a flair from a user's comment, we will update the user table to have the flair for that user
flair_task = (str(comment.author_flair_text), str(comment.author))
flair_sql = ''' UPDATE User SET flair=(?) WHERE name=(?)'''
# try executing SQL statements above
cur = conn.cursor()
try :
ur.execute(comment_sql, comment_task)
cur.execute(user_sql, user_task)
cur.execute(flair_sql, flair_task)
except Exception as e:
print(e)
# commit additions to the DB
conn.commit()
# close connection for now
conn.close()
else:
print("Error! cannot create the database connection.")
The last step in the scraping process is to grab a list of all the subreddits that any user on r/UMD have ever commented in. We want to analyze later on what other subreddits that r/UMD users are interested in. We will do this again by reading our current SQL user table into a Pandas DataFrame and then grabbing the subreddit from each comment for a user, for all users.
# create a database connection and make a dataframe for quicker operations, again
conn = create_connection("./R_UMD.db")
df = pd.read_sql("SELECT * FROM User", conn)
# create tables
if conn is not None :
# we need to get each user and look at their individual subreddit history
for i, row in df.iterrows() :
# actual call to the API, get a 'User' object so we can see all their comments
red = r.redditor(row['name'])
# we will loop through all of the user's comments and add the subreddit they commented in to the database
for x in red.comments.new(limit=None) :
# we will simply add the username and subreddit to the table
subreddit_task = (row['name'], str(x.subreddit))
subreddit_sql = ''' INSERT or IGNORE INTO UserSubreddits(name,subreddit)
VALUES(?,?) '''
# try executing SQL statement above
cur = conn.cursor()
try :
cur.execute(subreddit_sql, subreddit_task)
except :
cur.close()
conn.close()
conn.commit()
conn.close()
else:
print("Error! cannot create the database connection.")
Our data is pretty tidy after scraping from Reddit (we will do some cleaning of text later on), so now we will read it one last time into four separate Pandas DataFrames. We also want to get rid of any bots that may have posted in r/UMD. To do this we will simply remove any data associated with users whose usernames end in "bot" (most honest bots end with this string).
# create the connection
conn = create_connection("./R_UMD.db")
# make dataframes from each table in the SQLite database
df_user = pd.read_sql("SELECT * FROM User", conn)
df_user_sub = pd.read_sql("SELECT * FROM UserSubreddits", conn)
df_post = pd.read_sql("SELECT * FROM Post", conn)
df_comment = pd.read_sql("SELECT * FROM Comment", conn)
# close connection -- no longer needed
conn.close()
# try and find all bots (usernames end with 'bot')
df_bot = df_user_sub[df_user_sub['name'].str.endswith("bot")]
gb = df_bot.groupby('name')
gb = [gb.get_group(x) for x in gb.groups]
# remove any row from df_user_sub if it's from a bot
for name in gb :
df_user_sub = df_user_sub[df_user_sub.name != str(name['name'].reset_index(drop=True)[0])]
df_user = df_user[df_user.name != str(name['name'].reset_index(drop=True)[0])]
df_post = df_post[df_post.name != str(name['name'].reset_index(drop=True)[0])]
df_comment = df_comment[df_comment.name != str(name['name'].reset_index(drop=True)[0])]
Reddit has become an increasingly popular form of spreading news around campus or promoting a club or event. Because of this, we wanted to investigate how many users are active Redditors and which ones have accounts just to post on r/UMD. As it turns out, about 50% (~9,000) of the users who have ever made a post on r/UMD post on r/UMD infrequently (<25% of their posts). On the other hand, about 25% (~4,000) of r/UMD users post exclusively on r/UMD (95% of their posts or higher).
# group the dataframe by user
gb = df_user_sub.groupby('name')
gb = [gb.get_group(x) for x in gb.groups]
l = list()
names = list()
# for each users group of submissions, find out what percentage of them are in r/UMD
for name in gb :
try :
l.append([str(name['name'].reset_index(drop=True)[0]),
(name[name.subreddit == 'UMD']['subreddit'].value_counts()/name['subreddit'].count())[0]])
except :
# User never posted in r/UMD, only commented
pass
# plot the result as a histogram
fig = px.histogram(x=[row[1]*100 for row in l], nbins=5, title="Distribution of Users' Percentage of Posts on r/UMD",
labels=dict(x="Percentage of Posts in r/UMD"))
fig.update_xaxes(range=[0, 100])
fig.update_layout(yaxis_title="Number of Users")
fig.show()
It is clear that despite a large portion of users only posting on r/UMD, there is still a large percentage of users that are active in other subreddits. Below are the top 30 alternative subreddits that r/UMD users post in. They are split up into two groups:
Default subreddits are subreddits that a user is automatically subscribed to when they first make an account on Reddit. r/UMD users have started to branch away from these subreddits as 20 of the top 30 subreddits they post in are non-default.
# dataframe that has posts not in r/UMD
df_non_umd = df_user_sub[df_user_sub.subreddit != 'UMD']
top_subs = (df_non_umd['subreddit'].value_counts()/df_non_umd['subreddit'].count())[:30]*100
subs = top_subs.index.tolist()
vals = top_subs.tolist()
# new figure for graph
fig = go.Figure()
# list of defaults subreddits that users are subscribed to.
defaults = ['AskReddit','funny','pics','todayilearned','gaming','videos','IAmA','worldnews','news','aww','gifs','movies',
'mildlyinteresting','Showerthoughts','Music','science','explainlikeimfive','LifeProTips','personalfinance']
c_def = 0
c_oth = 0
for i, sub in enumerate(subs) :
# add a new bar to the graph
fig.add_trace(go.Bar(
x=[sub],
y=[vals[i]],
name='Default Subreddits' if sub in defaults else 'Other',
marker_color='lightsalmon' if sub in defaults else 'blue',
showlegend=True if c_def == 0 and sub in defaults or c_oth == 0 and sub not in defaults else False,
legendgroup='lightsalmon' if sub in defaults else 'blue'
))
c_def += 1 if sub in defaults else 0
c_oth += 1 if sub not in defaults else 0
# plot the graph
fig.update_layout(yaxis_title="Percentage of All Users' Posts", xaxis_title="Subreddit", title="Most Popular Alternative Subreddits")
fig.show()
According to The Office of Institutional Research Planning and Assessment, the top 5 undergraduate degrees are all in 'STEM' majors. We wanted to see if this popularity trend persisted in the r/UMD user base. About 10% of the users on r/UMD have 'flairs', i.e. banners next to their name that typically describe what major they are. See this for more detail. Using the users who have flairs as a representative sample, we can make a prediction on what majors the rest of the r/UMD users are.
# make a new dataframe that only has
df_user_train = df_user.replace(to_replace='None', value=np.nan).replace(to_replace='', value=np.nan).dropna()
df_user_train['flair_clean'] = "unknown"
# function to determine if a user is STEM or NON-STEM
# returns the type of major as a string, 'unknown' if cannot be determined
def flair_clean(flair) :
flair = ''.join(i for i in flair if not i.isdigit())
flair = flair.lower()
# list of prefixes/infixes that denote a STEM major
stem = {"cs","computer science","comp sci","cmsc","kruskal","cmns","compsci","info sci","ischool","infosci", \
"bchm","bio","chem","compe","ce","computer","compeng","comp","ee","aero","enae","enme","mech","meng", \
"math","markov","phys","it support","phnb","aosc","gis","bsci","info","chbe","fire","inst","ae", \
"network","premed","fpe","stem","ensp","enst","astro"," is", "civ","comsci","ents","mse", "stack", \
"eng","stat","amsc","numer","web","matsci","cmps","psci","cbmg","cpe","astr","me ","mate","enfp", \
"anatomy","bis","soft"}
# list of prefixes/infixes that denote a NON-STEM major
non_stem = {"econ","comm","journal","gvpt","government","policy","gov","criminal","ccjs","crim","bmgt", \
"manage","market","business","kinesi","knes","psyc","ecology","sociology","socy","design","anthro", \
"film","english","engl","arch","larc","philosophy","arhu","women","arec","anth","creative", "history", \
"ansc","hist","plcy","amer","account","jour","geog","supply","art","geol","theatre","scm","agnr", \
"music","social","lang","hort","public","ling","elem","arabic","hcim","nfsc","jap","fmsc","mph", \
"ath","jd","fin","russian","germ","fam","agro","enology"}
for major in stem :
if (major in flair) :
return "STEM"
for major in non_stem :
if (major in flair) :
return "NON STEM"
return "unknown"
for i, row in df_user_train.iterrows() :
new = flair_clean(str(row['flair']))
df_user_train.at[i, 'flair_clean'] = new
After cleaning the user flair data and determining the percentage of STEM and NON STEM majors in the sample, we can then find the confidence interval for a single proportion. Looking at the sample, we are 95% confident that the true proportion of STEM majors on r/UMD is between 77.8% and 81.93%. Finding complete data for the University is hard, although the IRPA report indicates that the true percentage of STEM majors for the university may lie closer to 50% - 60%. We also must be wary of our results, as our sample was not truly random due to some data missing at random.
# get the number of STEM and NON STEM majors
num_stem = len(df_user_train[df_user_train.flair_clean == "STEM"])
num_non_stem = len(df_user_train[df_user_train.flair_clean == "NON STEM"])
# calculate the 95% confidence interval
lower, upper = smp.proportion_confint(num_stem, num_stem + num_non_stem, alpha=0.05, method='normal')
error = (upper - lower)
avg = (upper + lower) / 2
# labels for the plot
labels = ['STEM Majors','Margin of Error','NON-STEM Majors']
values = [lower, error, 1-upper]
# plot the pie graph
fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.update_layout(title_text='Estimated Percentage of STEM Majors and NON-STEM Majors (95% Confidence)')
fig.show()
Let's start by plotting histograms of post activity and comment activity over time.
fig = px.histogram(
x=[datetime.utcfromtimestamp(s) for s in df_post["created_utc"]],
title='Posts over Time'
)
fig.show()
fig2 = px.histogram(
x=[datetime.utcfromtimestamp(s) for s in df_comment["created_utc"]],
title='Comments over Time'
)
fig2.show()
Pretty cool! You can definitely see the general upward trend of activity, as well as seasonal spikes. Hovering over the months, you can see that the dips are typically June, July, August, as well as January and February. This makes sense, as these are breaks in which students are not engaged with UMD on a day to day basis.
So what if we want to put numbers to these trends? For example, how many fewer posts per month are there during breaks, and approximately how many more comments per month are there per year? In order to do this, we should fit a regression, predicting activity (measured by posts/comments per month) based on year and season.
The first thing we need to do is to restructure the data so that we have the posts/comments per month available.
# Don't mess up the rest of the code
df_commentActivity = df_comment.copy()
df_postActivity = df_post.copy()
# Currently, we only have timestamps. Create two new columns by converting those timestamps into dates,
# and retrieving the relevant information from those dates.
df_commentActivity["month"] = [datetime.utcfromtimestamp(s).strftime("%b") for s in df_comment["created_utc"]]
df_commentActivity["year"] = [datetime.utcfromtimestamp(s).year for s in df_comment["created_utc"]]
df_postActivity["month"] = [datetime.utcfromtimestamp(s).strftime("%b") for s in df_post["created_utc"]]
df_postActivity["year"] = [datetime.utcfromtimestamp(s).year for s in df_post["created_utc"]]
# By grouping by month and year, we can get a count for every month (e.g. Nov 2016, Dec 2016, Jan 2017, etc.)
df_commentActivity = \
df_commentActivity.groupby(["year", "month"], as_index=False).size().reset_index().rename(columns={0: "count"})
df_postActivity = \
df_postActivity.groupby(["year", "month"], as_index=False).size().reset_index().rename(columns={0: "count"})
# One might think that a good categorization of the months would be by season - Winter, Spring, etc.
# However, looking at the seasonal trends on the histogram, there doesn't seem to be a big distinction between
# fall and spring semesters, nor winter and summer breaks. Thus, we can split Break/Semester instead of season.
seasonLookup = {
"Jan": "Break",
"Feb": "Semester",
"Mar": "Semester",
"Apr": "Semester",
"May": "Semester",
"Jun": "Break",
"Jul": "Break",
"Aug": "Break",
"Sep": "Semester",
"Oct": "Semester",
"Nov": "Semester",
"Dec": "Break"
}
# Create two new columns based on our lookup.
df_commentActivity["season"] = [seasonLookup[m] for m in df_commentActivity["month"]]
df_postActivity["season"] = [seasonLookup[m] for m in df_postActivity["month"]]
Now that the data is structured, we can try to fit a regression. Regression is the use of one or more x variables to predict a y variable, by using a line of best fit. A gentle introduction to regression can be found here using only one x predictor.
Initially, we have one response (count), and two predictors (year, and season). Let's try to fit it now.
modelComment = smf.ols(formula='count ~ year+season', data=df_commentActivity).fit()
# Regression fit
print(modelComment.summary())
# Graphs to check assumptions of regression
px.violin(x=modelComment.fittedvalues, y=modelComment.resid, title='Residuals vs. Fitted').show()
px.histogram(x=modelComment.resid, title='Distribution of Residuals').show()
px.scatter(x=[i for i in range(len(modelComment.resid))], y=modelComment.resid, title='Residuals vs. Index')
Looking at just the p-values, it looks good. All of them are significantly below the 5% threshold. However, regression only works under five assumptions:
The violin graph is used to assess linearity, and homoscedasticity. For linearity, we are checking that each violin is centered around 0, and for homoscedasticity, we are checking that each violin is "roughly" the same size. Both of these clearly are violated due to many of the violins not being centered at 0, and wild variation in violin lengths.
The histogram is used to assess normality, and we are checking that the histogram follows a bell-curve. There is a right skew, and the curve is a little flat to be considered normal. Thus, normality is also violated.
The last scatter plots index vs. residual, used to check independence, and we are checking if there is a pattern with regards to the obesrvations. There is a clear pattern: a downward trend until about the 80th observation.
Statsmodels has also given us a warning that multicollinearity may be an issue.
Oh no! We have scored a 0/5 in meeting the assumptions. In order for our model to give us usable numbers, we must attempt to meet all of these assumptions.
The following steps will identify each of the assumptions, explain in simple terms why it is neccessary to uphold that assumption before performing analysis, and suggest a correction to meet this assumption for the given data set of r/UMD data. Further reading on other corrections for other data sets can be found here, provided by Duke University.
The first assumption we should tackle is linearity. Linearity means that there is roughly a linear relationship between the predictors and the response. Because it was violated in our initial model, it seems that we may have a non-linear relationship between one of our predictors and count. This could make sense, it seems that as the years increase, the activity increases at a non-linear rate. We will try introducing a year2 variable to our dataset to try and capture this non-linear relationship.
# The square function
square = lambda x: x**2
# add the year squared
modelComment = smf.ols(formula='count ~ year+square(year)+season', data=df_commentActivity).fit()
print(modelComment.summary())
px.violin(x=modelComment.fittedvalues, y=modelComment.resid, title='Residuals vs. Fitted').show()
px.histogram(x=modelComment.resid, title='Distribution of Residuals').show()
px.scatter(x=[i for i in range(len(modelComment.resid))], y=modelComment.resid, title='Residuals vs. Index')
That seemed to really help! The violins are more centered around 0, the histogram looks more like a bell curve, and we've eliminated the trend in the independence graph. However, the violins on the right are much bigger than the violins on the left—we will attempt to tackle this next.
Homoscedasticity is a fancy word meaning "equal spread." When we check for this assumption, we are making sure that the variance in the residuals is roughly the same in all places. The residuals are errors: if the errors get increasingly larger as the predicted values get larger, then our model will have trouble accurately predicting for large y-values. Again, this is due to the non-linear trend in the data, as we start with comments in the tens and hundreds and grow to thousands. Using a log transformation on the count variable will lessen the impact of how quickly the subreddit grew, and hopefully ensure the violins all are the same size.
square = lambda x: x**2
# transform count into log(count)
modelComment = smf.ols(formula='np.log(count) ~ year+square(year)+season', data=df_commentActivity).fit()
print(modelComment.summary())
px.violin(x=modelComment.fittedvalues, y=modelComment.resid, title='Residuals vs. Fitted').show()
px.histogram(x=modelComment.resid, title='Distribution of Residuals').show()
px.scatter(x=[i for i in range(len(modelComment.resid))], y=modelComment.resid, title='Residuals vs. Index')
The transformation again did wonders for all three graphs: the violins (bar one) are now roughly the same size, our histogram (bar the skew) is normal, and the independence graph is looking better as well. In all three cases, there is one culprit: outliers.
The assumption of normality checks that the residuals (not the data!) follow a normal distribution. If the residuals are non-normal, such as right now, there will be a skew when estimating p-values. Our curve looks pretty decent, except for a few outliers giving it a right skew. In order to fix this, we will investigate the outliers and possibly remove them.
square = lambda x: x**2
# previous model
modelComment = smf.ols(formula='np.log(count) ~ year+square(year)+season', data=df_commentActivity).fit()
# test for outliers
test = modelComment.outlier_test()
outliers = [i for i,t in enumerate(test["bonf(p)"]) if t < 0.5]
# investigate outliers
print([df_commentActivity.iloc[i] for i in outliers])
# drop outliers
df_commentActivityNoOutliers = df_commentActivity.drop(outliers)
# Use model without outliers
modelComment2 = smf.ols(formula='np.log(count) ~ year+square(year)+season', data=df_commentActivityNoOutliers).fit()
print(modelComment2.summary())
px.violin(x=modelComment2.fittedvalues, y=modelComment2.resid, title='Residuals vs. Fitted').show()
px.histogram(x=modelComment2.resid, title='Distribution of Residuals').show()
px.scatter(x=[i for i in range(len(modelComment2.resid))], y=modelComment2.resid, title='Residuals vs. Index')
Investigating the outliers, it seems that they are from the inception of the subreddit. Because they differ from the other points in terms of their x-distance and their y-distance, they significantly affect the model and should be taken out. Because we have over 100 other observations, we should be fine.
After successfully removing the outliers, the graphs are centered better. Now, we will tackle independence.
At first, the violin plot looks fine: all of the violins' centers fall between -1 and 1, and the spread for the most part is roughly the same throughout. However, looking at the centers of the violins, there is a certain up-and-down curvature, especially at the beginning. Looking at the scatter plot, this same curvature exists. There is still a slight pattern in our scatter plot, meaning the residuals are not independent with respect to time. In order to capture this relationship, we can add a lagged variable, meaning we can use the count of the previous month to predict the current month.
square = lambda x: x**2
modelComment = smf.ols(formula='np.log(count) ~ year+square(year)+season', data=df_commentActivity).fit()
test = modelComment.outlier_test()
outliers = [i for i,t in enumerate(test["bonf(p)"]) if t < 0.5]
df_commentActivityNoOutliers = df_commentActivity.drop(outliers)
df_commentActivityNoOutliersLagged = df_commentActivityNoOutliers.copy()
# Create lagged variable
df_commentActivityNoOutliersLagged["countLag"] = df_commentActivityNoOutliersLagged["count"].shift(-1)
modelComment2 = smf.ols(formula='np.log(count) ~ year+square(year)+season+np.log(countLag)',
data=df_commentActivityNoOutliersLagged).fit()
print(modelComment2.summary())
px.scatter(x=modelComment2.fittedvalues, y=modelComment2.resid, title='Residuals vs. Fitted').show()
px.histogram(x=modelComment2.resid, title='Distribution of Residuals').show()
px.scatter(x=[i for i in range(len(modelComment2.resid))], y=modelComment2.resid, title='Residuals vs. Index')
Unfortunately, the violin plots are too fine here to be displayed, so we've used a scatter plot in place of the violin. The interpretation is the same: if we imagine splitting the dots horizontally into chunks, then each chunk should have a mean of zero and a spread that is equal throughout. Save for a few minor outliers, the assumptions have almost been satsified. Now, let's get rid of that collinearity error.
This one is rather simple: year and square(year) have a strong collinearity because the latter is based on the former. We can center these variables by subtracting by their means. This will produce an equivalent model.
square = lambda x: x**2
# centering function
center = lambda x: x - x.mean()
modelComment = smf.ols(formula='np.log(count) ~ year+square(year)+season', data=df_commentActivity).fit()
test = modelComment.outlier_test()
outliers = [i for i,t in enumerate(test["bonf(p)"]) if t < 0.5]
df_commentActivityNoOutliers = df_commentActivity.drop(outliers)
df_commentActivityNoOutliersLagged = df_commentActivityNoOutliers.copy()
df_commentActivityNoOutliersLagged["countLag"] = df_commentActivityNoOutliersLagged["count"].shift(-1)
modelComment2 = smf.ols(formula='np.log(count) ~ center(year)+square(center(year))+season+np.log(countLag)',
data=df_commentActivityNoOutliersLagged).fit()
print(modelComment2.summary())
px.scatter(x=modelComment2.fittedvalues, y=modelComment2.resid, title='Residuals vs. Fitted').show()
px.histogram(x=modelComment2.resid, title='Distribution of Residuals').show()
px.scatter(x=[i for i in range(len(modelComment2.resid))], y=modelComment2.resid, title='Residuals vs. Index')
Ta-da! Our model should be ready for interpretation. However, along the way, our season and year2 variables have become statistically insignificant. Nevertheless, season still definitely has an effect on the activity per month. The issue is probably that we have defined season too widely—perhaps it will become statistically significant if we narrow it to summer break only.
# Redefine "season" to be only summer break
df_commentActivity["summer"] = [True if m in ["Jun", "Jul", "Aug"] else False for m in df_commentActivity["month"]]
square = lambda x: x**2
center = lambda x: x - x.mean()
modelComment = smf.ols(formula='np.log(count) ~ year+square(year)+summer', data=df_commentActivity).fit()
test = modelComment.outlier_test()
outliers = [i for i,t in enumerate(test["bonf(p)"]) if t < 0.5]
df_commentActivityNoOutliers = df_commentActivity.drop(outliers)
df_commentActivityNoOutliersLagged = df_commentActivityNoOutliers.copy()
df_commentActivityNoOutliersLagged["countLag"] = df_commentActivityNoOutliersLagged["count"].shift(-1)
# Remove year^2 and fit summer instead of season
modelComment2 = smf.ols(formula='np.log(count) ~ center(year)+summer+np.log(countLag)',
data=df_commentActivityNoOutliersLagged).fit()
print(modelComment2.summary())
px.scatter(x=modelComment2.fittedvalues, y=modelComment2.resid, title='Residuals vs. Fitted').show()
px.histogram(x=modelComment2.resid, title='Distribution of Residuals').show()
px.scatter(x=[i for i in range(len(modelComment2.resid))], y=modelComment2.resid, title='Residuals vs. Index')
Voila! There is our final model, with all assumptions satisfied and all variables signficant. We can now interpret the coefficients:
df_postActivity["summer"] = [True if m in ["Jun", "Jul", "Aug"] else False for m in df_postActivity["month"]]
square = lambda x: x**2
center = lambda x: x - x.mean()
modelPost = smf.ols(formula='np.log(count) ~ year+square(year)+summer', data=df_postActivity).fit()
test = modelPost.outlier_test()
outliers = [i for i,t in enumerate(test["bonf(p)"]) if t < 0.5]
df_postActivityNoOutliers = df_postActivity.drop(outliers)
df_postActivityNoOutliersLagged = df_postActivityNoOutliers.copy()
df_postActivityNoOutliersLagged["countLag"] = df_postActivityNoOutliersLagged["count"].shift(-1)
# Remove year^2 and fit summer instead of season
modelPost2 = smf.ols(formula='np.log(count) ~ center(year)+summer+np.log(countLag)',
data=df_postActivityNoOutliersLagged).fit()
print(modelPost2.summary())
We can easily do the same for posts:
Exploring the idea of this data as a time-series, we can maybe investigate a few more trends. Using SQL, we can select group every user's posts, and find their first and last posts and comments. By combining this information, we can see a user's first and last activity (post or comment) on this subreddit.
conn = create_connection("./R_UMD.db")
df_startend = pd.read_sql(
"""SELECT name,
MIN(mn) as firstUTC, MAX(mx) as lastUTC,
MAX(mx) - MIN(mn) as durationUTC,
DATETIME(MIN(mn), 'unixepoch') as firstDate,
DATETIME(MAX(mx), 'unixepoch') as lastDate,
JulianDay(DATETIME(MAX(mx), 'unixepoch')) - JulianDay(DATETIME(MIN(mn), 'unixepoch')) as durationDays FROM (
SELECT name, MIN(created_utc) as mn, MAX(created_utc) as mx FROM Post GROUP BY name
UNION
SELECT name, MIN(created_utc) as mn, MAX(created_utc) as mx FROM Comment GROUP BY name
) t1 GROUP BY name ORDER BY durationDays DESC""",
conn
)
conn.close()
df_startend.head()
Using this structured data, we can start to plot the first and last activity for every user.
fig = px.histogram(
x=[datetime.utcfromtimestamp(s).month for s in df_startend["firstUTC"]],
title='First activity by month'
).show()
fig = px.histogram(
x=[datetime.utcfromtimestamp(s).month for s in df_startend["lastUTC"]],
title='Last activity by month'
).show()
This reveals a couple interesting trends. For both graphs, the spring semester is very similar to the fall semester, and thus, it seems that the cycle of activity is semester-by-semester, rather than year-by-year. Also, the shape of the semesters are different for first activity and last activity. The first activity seems to peak in the middle of the semester, perhaps suggesting that users slowly learn about Reddit, peaking during the middle of the semester. The last activity slowly increases as time goes on. This makes sense, as users are most likely to stop interacting with the subreddit as they graduate or go on break.
Let's see if we can get some statistics for how long users stay on the subreddit.
px.histogram(
x=[timedelta(seconds=s).days for s in df_startend["durationUTC"]],
title="Days actively participating in subreddit"
).show()
print("Mean duration: {}".format(timedelta(seconds=df_startend["durationUTC"].mean())))
print("Median duration: {}".format(timedelta(seconds=df_startend["durationUTC"].median())))
print("Standard deviation of duration: {}\n".format(timedelta(seconds=df_startend["durationUTC"].std())))
print("Users with a single activity: {}".format(df_startend["durationUTC"][df_startend["durationUTC"] == 0].count()))
print("Percentage of users with a single activity: {0:.2f}%".format(
df_startend["durationUTC"][df_startend["durationUTC"] == 0].count() * 100 /
df_startend["durationUTC"].count()))
With over a third of the users only participating once, it's hard to see the bigger picture. Let's take them out and see the data again.
px.histogram(
x=[timedelta(seconds=s).days for s in df_startend[df_startend["durationUTC"] > 0]["durationUTC"]],
title="Days actively participating in subreddit (more than one post/comment)"
).show()
print("Mean duration: {}".format(timedelta(seconds=df_startend[df_startend["durationUTC"] > 0]["durationUTC"].mean())))
print("Median duration: {}".format(timedelta(seconds=df_startend[df_startend["durationUTC"] > 0]["durationUTC"].median())))
print("Standard deviation of duration: {}\n".format(
timedelta(seconds=df_startend[df_startend["durationUTC"] > 0]["durationUTC"].std())))
secsInDay = 60*60*24;
print("Users only posting on a single day: {}".format(
df_startend["durationUTC"][df_startend["durationUTC"] < secsInDay].count()))
print("Percentage of users with less than a single day of activity: {0:.2f}%".format(
df_startend["durationUTC"][df_startend["durationUTC"] < secsInDay].count() * 100 /
df_startend["durationUTC"].count()))
The distribution is again heavily skewed. Further analysis reveals that almost half of users never post for more than one day.
Let's try one more time, filtering out all users who only have activity on one day.
px.histogram(
x=[timedelta(seconds=s).days for s in df_startend[df_startend["durationUTC"] > secsInDay]["durationUTC"]],
title="Days actively participating in subreddit (more than one day of posting/commenting)"
).show()
print("Mean duration: {}".format(
timedelta(seconds=df_startend[df_startend["durationUTC"] > secsInDay]["durationUTC"].mean())))
print("Median duration: {}".format(
timedelta(seconds=df_startend[df_startend["durationUTC"] > secsInDay]["durationUTC"].median())))
print("Standard deviation of duration: {}\n".format(
timedelta(seconds=df_startend[df_startend["durationUTC"] > 0]["durationUTC"].std())))
Out of curiousity, let's plot the activity of users at arbitrary breakpoints: 1 post, 1 day, 1 month, 1 semester, 1 year, 2 years, 4 years.
pieSlices = [
len(df_startend[df_startend["durationUTC"] == 0]),
len(df_startend[(df_startend["durationUTC"] > 0) & (df_startend["durationUTC"] < secsInDay)]),
len(df_startend[(df_startend["durationUTC"] > secsInDay) & (df_startend["durationUTC"] < secsInDay * 30)]),
len(df_startend[(df_startend["durationUTC"] > secsInDay * 30) & (df_startend["durationUTC"] < secsInDay * 180)]),
len(df_startend[(df_startend["durationUTC"] > secsInDay * 180) & (df_startend["durationUTC"] < secsInDay * 365)]),
len(df_startend[(df_startend["durationUTC"] > secsInDay * 365) & (df_startend["durationUTC"] < secsInDay * 365 * 2)]),
len(df_startend[(df_startend["durationUTC"] > secsInDay * 365 * 2) & (df_startend["durationUTC"] < secsInDay * 365 * 4)]),
len(df_startend[(df_startend["durationUTC"] > secsInDay * 365 * 4)]),
]
labels = \
["1 post", "<1 day", "1 day - 1 month", "1 month - 6 months", "6 months - year", "1-2 years", "2-4 years", ">4 years"]
total = 0
text = []
for sl in pieSlices:
total += sl
text.append(total)
text = ["{0:.2f}%".format(t / total * 100) for t in text]
go.Figure(data=[go.Pie(labels=labels, values=pieSlices, text=text, sort=False)]) \
.update_traces(hoverinfo='label+percent', textinfo='text', textfont_size=14) \
.show()
The chart shows the cumulative percentage of users who have posted less than the given timeframe on the slice. Hovering over any slice will give you the size of the slice, as a percentage.
Given this chart, we can see that 88.2% of users are active less than 2 years. This dispels the notion that there is a common archetype of 4-year users (users who join their freshman year and leave their senior year). 69.6% of users are constrained within the time frame of a single semester.
Next, we will be categorizing all of r/UMD's posts into different groups. No posts are labeled already, and in fact the categories are not defined yet. We will be using Scikit-Learn's KMeans algorithm, which requires us to properly prepare our text dataset and create a TF-IDF matrix. This will be an example of unsupervised machine learning, as we do not have a labeled dataset to test the KMeans model.
First, we need to define a function to clean up our text. This function breaks a line into tokens, reduces tokens to their stems, and removes stopwords, which are words that don't add anything to the post's meaning (i.e. articles). This effectively sanitizes our text and makes it about as uniform as possible.
#create stemmer and stopwords
ps = PorterStemmer()
words = stopwords.words('english')
#breaks words into stems, forces them into lowercase, tokenizes based on whitespace, and removes stopwords
def clean(x) :
return ' '.join([ps.stem(i) for i in re.sub('[^a-zA-Z]', ' ', x).split() if i not in words]).lower()
Now that we have defined our tokenization function, we can use it to clean up our dataset. Let's go ahead and apply it to both the title and the body of the post, and save the "clean" versions as new columns. While we're at it, we can add a new column which contains the title appended to the body, since for most of our analysis, we will treat the combination of title and body as the text of each post. This weights the title and the post body equally in determining the content.
#clean the title and text by applying the clean function as stated above
title_clean = df_post['title'].apply(clean)
text_clean = df_post['selftext'].apply(clean)
#concatenates the cleaned title and text to create a column with all words per post
df_post['doc'] = title_clean.map(str) + text_clean
In order to create a TF-IDF matrix, we will use Scikit-Learn's TfidfVectorizer package. Since we have already cleaned our data, all we have to do is create a new TfidfVectorizer, convert the post texts to a list and fit the vectorizer, and construct a new dataframe from the result.
titles = df_post['title'].tolist()
corpus = df_post['doc'].tolist()
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
df_post_tfidf = pd.DataFrame(X.T.todense(), index=vectorizer.get_feature_names(), columns = titles)
Now that we have a TF-IDF matrix, we can use Scikit-Learn's KMeans function to split the data up into clusters based on similarities in text from the TF-IDF matrix.
Unfortunately, KMeans is a non-deterministic algorithm, meaning that it will give a different result if run multiple times. This can lead to some interesting results, as the clusters change each time the algorithm is run. This can be alleviated by providing an integer for a random seed, which will cause the KMeans to give the same result every time. Though the selection is trivial, we have chosen 971 as our random seed.
Here we are specifying k = 15 to categorize the data into 15 clusters. We need to manually inspect each cluster to see what the posts in each cluster have in common, and we can give names to our clusters.
# using KMeans, cluster the data into a set number of categories
true_k = 15
r = 971 # KMeans is non-deterministic unless we specify the random seed
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1, random_state = r)
# fit the model
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i),
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind])
# free up extra space
del df_post_tfidf
After looking at the clusters, we have decided on some appropirate titles for each group:
subjects = {0 : 'major requirements', 1 : 'admissions / trasfer', 2 : 'math', 3 : 'sports', 4 : 'general umd',
5 : 'parking', 6 : 'housing', 7 : 'cs classes', 8 : 'housing', 9 : 'events / internet', 10 : 'weekly posts',
11 : 'registration', 12 : 'course / campus questions'}
Now our KMeans model is trained! We can test it out by taking a random sample of the data and predicting each post's category by using the model's predict() function. We have printed this output below so it can be seen how accurate the groupings are.
# Unwraps the prediction from the model and looks up the category string in the dictionary, as well as grouping
# classifications with similar characteristics.
def classify(post) :
Y = vectorizer.transform([post])
prediction = model.predict(Y)[0]
if prediction == 12 :
prediction = 11
if prediction > 12 :
prediction = 12
return prediction
#create random sample of dataframe
sample = df_post.sample(n=40)
header_str = '~~~~~~~~~~'
#sample = df
pred = []
#add column for the prediction to the dataframe
for row in sample.iterrows() :
pred.append(classify(row[1]['doc']))
sample['pred'] = pred
#display sample posts by subject
for i in range(0,13) :
print()
print(header_str,subjects[i],header_str)
sub = sample[sample['pred'] == i]
for row in sub.iterrows() :
print(row[1]['title'])
Since we can see that the categorization is accurate, we can go ahead and append a new column to our posts dataframe with their classification, and we are done with categorizing the posts of r/UMD!
classifications = []
# add classification for every row
for row in df_post.iterrows() :
classifications.append(subjects[classify(row[1]['doc'])])
df_post['class'] = classifications
df_post.drop(['doc'], axis = 1) # don't need this anymore
df_post.head(5)
We can also use Plotly to create a pie chart, which can be a nice visual aid to show the breakdown of the post categories across the entire subreddit.
To do this, we must iterate through the table and tally up each category.
subject_count = {'major requirements' : 0, 'admissions / trasfer' : 0, 'math' :0, 'sports' : 0, 'general umd' : 0,
'parking' : 0, 'housing' : 0, 'cs classes' : 0, 'housing' : 0, 'events / internet' : 0, 'weekly posts' : 0,
'registration' : 0, 'course / campus questions' : 0}
# tally up each category
for row in df_post.iterrows() :
subject_count[row[1]['class']] += 1
# Create temporary dataframe for use of Plotly
df_temp = pd.DataFrame()
df_temp['classification'] = subject_count.keys()
df_temp['count'] = subject_count.values()
fig = px.pie(df_temp, values = 'count', names='classification', title='Classification of r/UMD Posts by Percent')
fig.show()
Looking at the breakdown of posts, there are obviously a few topics much more common than others. Of course, posts about UMD take the lead, as this is a UMD-themed subreddit. There is a massive distribution of posts regarding advice about courses, as well as registration, housing, and campus events.
Next, we will look at all the posts of r/UMD and analyze their sentiments, classifying them as positive, negative, or neutral.
To do this, we will be using the VADER (Valence Aware Dictionary and sEntiment Reasoner) algorithm, which is a pre-trained model that specializes in sentiment analysis of social media posts.
VADER takes in a string and returns 4 scores: positive, neutral, negative, and compound. The first 3 reflect the percent of the string made up of positive, negative, and neutral keywords. These scores always add up to 1. Compound score is a composite of the first 3, between -1 and 1, which is normalized to account for context, length, and emphasis of the words. We define a score of >= 0.05 as positive, < 0.05 && > -0.05 as neutral, and <= 0.05 as negative, in accordance with the VADER guidelines.
Firstly, we define a simple function to return 'positive', 'negative', or 'neutral' based on the composite score.
analyzer = SentimentIntensityAnalyzer()
def classify_sentiment(sentence) :
# four-part score from VADER
score = analyzer.polarity_scores(sentence)
if score['pos'] >= 0.05 :
return 'positive'
if score['neg'] <= 0.05 :
return 'negative'
return 'neutral'
We need to iterate through all posts and run this function to get a sentiment for each post. We can also add a new column to the table corresponding to the sentiment of each post.
sentiments = []
for row in df_post.iterrows() :
p = classify_sentiment(row[1]['title'] + ' ' + row[1]['selftext'])
sentiments.append(p)
df_post['sentiment'] = sentiments
df_post.head(5)
Now, just as before, we can create a pie chart to illustrate the sentiment distribution of r/UMD using Plotly.
sent_count = {'positive' : 0, 'negative' : 0, 'neutral' : 0}
# iterate through the table and get sentiments
for row in df_post.iterrows() :
sent_count[row[1]['sentiment']] += 1
# plot the pie chart
df_temp = pd.DataFrame()
df_temp['sentiment'] = sent_count.keys()
df_temp['count'] = sent_count.values()
fig = px.pie(df_temp, values = 'count', names='sentiment', title='Sentiment of r/UMD Posts by Percent')
fig.show()
This pie chart shows us that most of the posts in r/UMD tend to have a positive sentiment.
But wait, there's more! Since it was so straightforward to perform a sentiment analysis on the posts, let's repeat the same process for the comments.
First, let's append a sentiment column to the comments dataframe.
sentiments = []
for row in df_comment.iterrows() :
p = classify_sentiment(row[1]['body'])
sentiments.append(p)
df_comment['sentiment'] = sentiments
df_comment.head(5)
And now, as before, we will create our pie chart showing the sentiment of the comments.
sent_count = {'positive' : 0, 'negative' : 0, 'neutral' : 0}
for row in df_comment.iterrows() :
sent_count[row[1]['sentiment']] += 1
df_temp = pd.DataFrame()
df_temp['sentiment'] = sent_count.keys()
df_temp['count'] = sent_count.values()
fig = px.pie(df_temp, values = 'count', names='sentiment', title = 'Sentiment of r/UMD Comments by Percent')
fig.show()
As can be seen in the above pie chart, the majority of the comments are positive, which may indicate that the r/UMD community has mostly supportive commentary.
In the following cell, we will tokenize every single word ever written in a post title, post description, or comment on r/UMD, and put them all in a single dataframe. This dataframe will have a column for the word, a column for the source of the word (post title, post description, or comment), a column for the username of the person who wrote the word, and a column for the date and time at which the word was posted.
To make the process more efficient, we will be collecting the words in lists of dictionaries that we will then add to a dataframe (this is faster than individually adding each word to the dataframe). We will first do this with the posts (both titles and descriptions), and then we will repeat the process with the comments. In order to successfully run this code, the "del" command was used to manually induce garbage collection to free up more memory, and the total memory allocated to the Docker container was doubled in the Docker settings. Because of the immense task this code performs, at the end of the block, we save it as a .CSV so that if the kernel restarts, we don't have to rerun this entire block of code.
# Create tokenizer based on a regular expression that filters out punctuation
# Includes apostrophes for contractions, hyphenated words, and periods for decimals
tokenizer = RegexpTokenizer('\d\.\d+|\w+[\'-]?\w*|\$?\d+\.\d+')
# Make list for the title and list for the description
# Each list will be a list of dictionaries that will then be converted to a dataframe
title_words_list = []
desc_words_list = []
for index, row in df_post.iterrows():
# tokenize the title
title_tokens = tokenizer.tokenize(row['title'])
# get parts of speech
title_tokens = nltk.pos_tag(title_tokens)
# tokenize the description
desc_tokens = tokenizer.tokenize(row['selftext'])
# get parts of speech
desc_tokens = nltk.pos_tag(desc_tokens)
for title_tok in title_tokens:
# key = col_name
title_dict = {}
# Convert each word to lower-case so that varied capitalization doesn't interfere with our word counts later
title_dict['word'] = title_tok[0].lower()
title_dict['pos'] = title_tok[1]
title_dict['source'] = 'title'
title_dict['user'] = row['name']
title_dict['sentiment'] = row['sentiment']
title_dict['date'] = row['created_utc']
# add created date again, but this time just the date rather than the date and time (we'll use this later)
created = time.localtime(row['created_utc'])
title_dict['date_ymd'] = dt.datetime(created.tm_year, created.tm_mon, created.tm_mday).timestamp()
title_words_list.append(title_dict)
for desc_tok in desc_tokens:
# key = col_name
desc_dict = {}
# Convert each word to lower-case so that varied capitalization doesn't interfere with our word counts later
desc_dict['word'] = desc_tok[0].lower()
desc_dict['pos'] = desc_tok[1]
desc_dict['source'] = 'description'
desc_dict['user'] = row['name']
desc_dict['sentiment'] = row['sentiment']
desc_dict['date'] = row['created_utc']
# add created date again, but this time just the date rather than the date and time (we'll use this later)
created = time.localtime(row['created_utc'])
desc_dict['date_ymd'] = dt.datetime(created.tm_year, created.tm_mon, created.tm_mday).timestamp()
desc_words_list.append(desc_dict)
# Add the words from the titles and the descriptions to a dataframe
words_frame = pd.DataFrame(title_words_list)
words_frame = words_frame.append(pd.DataFrame(desc_words_list))
# Clear up memory
del desc_words_list
del title_words_list
print("All words from posts and descriptions successfully added to dataframe.")
# Function to get the words from all the comments.
# This function will be called several separate times to deal with the memory issues,
# allowing us to clear up memory between each call.
def get_words_from_comments(comm_words_list, start, end):
# Make list for the comments
count = start
if(end > df_comment.shape[0]):
end = df_comment.shape[0]
for index, row in df_comment[start:end].iterrows():
# tokenize the comment
comm_tokens = tokenizer.tokenize(row['body'])
comm_tokens = nltk.pos_tag(comm_tokens)
for comm_tok in comm_tokens:
# key = col_name
comm_dict = {}
# Convert each word to lower-case so that varied capitalization doesn't interfere with our word counts later
comm_dict['word'] = comm_tok[0].lower()
comm_dict['pos'] = comm_tok[1]
comm_dict['source'] = 'comment'
comm_dict['user'] = row['name']
comm_dict['sentiment'] = row['sentiment']
comm_dict['date'] = row['created_utc']
# add created date again, but this time just the date rather than the date and time (we'll use this later)
created = time.localtime(row['created_utc'])
desc_dict['date_ymd'] = dt.datetime(created.tm_year, created.tm_mon, created.tm_mday).timestamp()
comm_words_list.append(comm_dict)
# keep track of count to show progress for sanity -- this code
count += 1
if(count % 10000 == 0):
print("Processed comments:", count)
return comm_words_list
i = 0
while i < df_comment.shape[0]:
# We will process 30,000 comments at a time
i = i + 30000
comm_wordlist = get_words_from_comments([],i - 30000,i)
# Add the words from the comments to the dataframe
words_frame = words_frame.append(pd.DataFrame(comm_wordlist), sort=True)
# Clear up memory
del comm_wordlist
print("Total number of words in r/UMD:", len(words_frame))
words_frame.head()
# Save the dataframe as a .CSV so that we don't have to rerun all this code if the kernel restarts
words_frame.to_csv(path_or_buf='all_words_in_r_umd.csv', index=False)
We will avoid running the above cell unless the database containing all the r/UMD data has been updated. Otherwise, we will simply read from the CSV file that the above cell generates, as is done in the following cell.
# The "del words_frame" is included to allow us to read from the .CVS completely fresh.
# We've put it in a try-except block so that the cell can be fully evaluated regardless of whether words_frame is defined.
try:
del words_frame
except:
pass
words_frame = pd.read_csv('./all_words_in_r_umd.csv')
print("Total number of words in r/UMD:", len(words_frame))
words_frame.head()
In the following cells, we will count the number of words that each user has written, the number of posts each user has posted, the number of comments each user has commented, and the total karma each user has accumulated within r/UMD, put them all into dataframes, and write it all to three .CSV files. We will display the top 11 users for each category in tables, and visualize the top 50 users in bar charts using Plotly.
# Get the most verbose users:
most_verbose = pd.DataFrame(words_frame['user'].value_counts()).reset_index()
most_verbose = most_verbose.rename(columns={'index':'user', 'user':'num_words'})
most_verbose.to_csv('most_verbose_users.csv', index=False)
print('Most Verbose Users:')
display(most_verbose.head(11))
# Truncate the dataframe to exclude "None" for the plot
most_verbose_trunc = most_verbose.truncate(before=1).head(50)
most_verbose_fig = go.Figure(
data=[go.Bar(x=most_verbose_trunc['user'], y=most_verbose_trunc['num_words'])],
layout_title_text="Most Verbose Users"
)
most_verbose_fig.update_yaxes(title_text = 'Number of Words')
most_verbose_fig.update_xaxes(title_text = 'User')#, range=[0.5,50])
most_verbose_fig.show()
# Get the most-posting users
most_posts = pd.DataFrame(df_post['name'].value_counts()).reset_index()
most_posts = most_posts.rename(columns={'index':'user', 'name':'num_posts'})
most_posts.to_csv('most_posting_users.csv', index=False)
print('Most Posting Users:')
display(most_posts.head(11))
# Truncate the dataframe to exclude "None" from the plot
most_posts_trunc = most_posts.truncate(before=1).head(50)
most_posts_fig = go.Figure(
data=[go.Bar(x=most_posts_trunc['user'], y=most_posts_trunc['num_posts'])],
layout_title_text="Most Posting Users"
)
most_posts_fig.update_xaxes(title_text='User')
most_posts_fig.update_yaxes(title_text='Number of Posts')
most_posts_fig.show()
# Get the most-commenting users
most_comments = pd.DataFrame(df_comment['name'].value_counts()).reset_index()
most_comments = most_comments.rename(columns={'index':'user', 'name':'num_comments'})
most_comments.to_csv('most_commenting_users.csv', index=False)
print('Most Commenting Users:')
display(most_comments.head(11))
# Truncate the dataframe to exclude "None" from the plot
most_comments_trunc = most_comments.truncate(before=1).head(50)
most_comments_fig = go.Figure(
data=[go.Bar(x=most_comments_trunc['user'], y=most_comments_trunc['num_comments'])],
layout_title_text="Most Commenting Users"
)
most_comments_fig.update_xaxes(title_text='User')
most_comments_fig.update_yaxes(title_text='Number of Comments')
most_comments_fig.show()
# Get the users with the most total karma from all their posts and comments on r/UMD
# Concatenate the df_post and df_comment dataframes, then group by the name, and then sum the karma
grouped_karma = pd.concat([df_post, df_comment], sort=False).groupby('name').sum()
# Sort by the score
sorted_karma = grouped_karma.sort_values('score', ascending=False).reset_index()
# Clean up our table
sorted_karma = sorted_karma.rename(columns={'name':'user', 'score':'karma'})
sorted_karma = sorted_karma.drop(columns=['created_utc'], axis=1)
display(sorted_karma.head(11))
# Truncate to remove None and get the first 50 to plot
most_karma_trunc = sorted_karma.truncate(before=1).head(50)
most_karma_fig = go.Figure(
data=[go.Bar(x=most_karma_trunc['user'], y=most_karma_trunc['karma'])],
layout_title_text="Users with the Most Karma Accrued in r/UMD"
)
most_karma_fig.update_xaxes(title_text='User')
most_karma_fig.update_yaxes(title_text='Amount of Karma')
most_karma_fig.show()
As is evident from the tables, the most verbose user and the user with the most posts, comments, and karma is "None," which is actually not a single user, but the collection of users that have since had their accounts deactivated, thus causing them to be listed as "None" in the dataframes. Thus, for the plots, we have excluded "None" by simply truncating the dataframes to exclude their first indices.
Notably, u/AutoModerator is ranked highest for number of posts, as it is a bot that makes weekly "This Week At UMD" posts. Ranked fourth in verboseness, second in posts, and eleventh in comments is u/umdit, the official Reddit account for UMD's IT department, which regularly posts about and responds to posts relating to IT issues. u/UMD_DOTS is also in the top 50 most verbose users for a similar reason due to posts and comments about transportation issues.
It is also easy to see that many of the users appear in each of the above plots (for instance, u/turtle_stank, u/MovkeyB, u/Miseryy, u/worldchrisis, and u/uldu to name a few). This makes sense, because in order to get a high verboseness score, a user would need to make a significant amount of comments and posts, which would usually cause the user to accumulate a large amount of karma.
# Get the users with the least total karma from all their posts and comments on r/UMD
# Sort by the score, least to most
sorted_karma = grouped_karma.sort_values('score', ascending=True).reset_index()
# Clean up our table
sorted_karma = sorted_karma.rename(columns={'name':'user', 'score':'karma'})
sorted_karma = sorted_karma.drop(columns=['created_utc'], axis=1)
display(sorted_karma.head(11))
# Truncate to remove None and get the first 50 to plot
least_karma_trunc = sorted_karma.head(50)
least_karma_fig = go.Figure(
data=[go.Bar(x=least_karma_trunc['user'], y=least_karma_trunc['karma'])],
layout_title_text="Users with the Lowest Karma Accrued in r/UMD"
)
least_karma_fig.update_xaxes(title_text='User')
least_karma_fig.update_yaxes(title_text='Amount of Karma')
least_karma_fig.show()
Many of these users are likely trolls, however, u/VectorMarketingRep is a representative of the multi-level marketing scam known as Vector Marketing. Judging by the immensely negative karma that u/VectorMarketingRep has, it is safe to say that most users of r/UMD are aware of the pyramid scheme that has been run by Vector Marketing.
# Verboseness ranking: returns [rank, number of words]
def rank_verbose(user):
for index, row in most_verbose.iterrows():
# case-insensitive string comparison because nobody remembers capitalization
if (row['user'].lower() == user.lower()):
return [index, row['num_words']]
# If user doesn't exist in r/UMD:
return [-1,-1]
# Posting ranking: returns [rank, number of posts]
def rank_posts(user):
for index, row in most_posts.iterrows():
# case-insensitive string comparison because nobody remembers capitalization
if (row['user'].lower() == user.lower()):
return [index, row['num_posts']]
# If user doesn't exist in r/UMD:
return [-1,-1]
# Posting ranking: returns [rank, number of comments]
def rank_comments(user):
for index, row in most_comments.iterrows():
# case-insensitive string comparison because nobody remembers capitalization
if (row['user'].lower() == user.lower()):
return [index, row['num_comments']]
# If user doesn't exist in r/UMD:
return [-1,-1]
'user' : The users's username
'num_posts' : Total number of posts the user has made in r/UMD
'posts_rank' : Their ranking in terms of their number of posts made in r/UMD as compared to all other users of r/UMD
'num_comments' : Total number of comments the user has made in r/UMD
'comments_rank' : Their ranking in terms of their number of comments made in r/UMD as compared to all other users of r/UMD
'num_words' : Total number of words written in r/UMD by the user
'words_rank' : Their ranking in terms of the number of words they have written in r/UMD as compared to all other users of r/UMD
'first_post_date_utc' : The time of the first post the user made in r/UMD in seconds since 1970
'first_post_date' : The date and time of the first post the user made in r/UMD, presented as a string
'first_post_title' : The title of the user's first post in r/UMD
'first_post_url' : The URL linking to the user's first post in r/UMD
'umd_post_karma' : The total amount of karma the user has accumulated in r/UMD alone from posts
'pop_post_karma' : The greatest amount of karma the user has ever received on a single post to r/UMD
'pop_post_title' : The title of the user's post that received the most karma out of all their posts to r/UMD
'pop_post_url' : The URL linking to the user's post that received the most karma out of all their posts to r/UMD
'worst_post_karma' : The least amount of karma the user has ever received on a single post to r/UMD
'worst_post_title' : The title of the user's post that received the least karma out of all their posts to r/UMD
'worst_post_url' : The URL linking to the user's post that reveived the least karma out of all their posts to r/UMD
'first_comment_date_utc' : The creation time of the first comment that the user ever made on a post in r/UMD in seconds since 1970
'first_comment_date' : The date and time of the first comment the user ever made to a post in r/UMD, presented as a string
'first_comment_body' : The text contained within the first comment the user ever made to a post in r/UMD
'umd_comment_karma' : The total amount of karma the user has received from comments in r/UMD alone
'pop_comment_karma' : The most amount of karma the user has ever received from a comment in r/UMD
'pop_comment_body' : The text contained within the user's comment that received the most karma out of all their comments in r/UMD
'worst_comment_karma' : The least amount of karma the user has ever received from a comment in r/UMD
'worst_comment_body' : The text contained within the user's comment that received the least karma out of all their comments in r/UMD
'total_umd_karma' : The total amount of karma the user has received from all their posts and comments in r/UMD
'favorite_word' : The word that appears most frequently in the user's post titles, post descriptions, and comments in r/UMD
'favorite_adj' : The adjective that appears most frequently in the user's post titles, post descriptions, and comments in r/UMD
'favorite_verb' : The verb that appears most frequently in the user's post titles, post descriptions, and comments in r/UMD
'favorite_noun' : The noun that appears most frequently in the user's post titles, post descriptions, and comments in r/UMD
'sentiment' : Gives a dictionary containing the percentage of the users posts and comments classified as having each sentiment. This dictionary's keys are 'positive', 'neutral', and 'negative'.
pos_noun = ['NN', 'NNS', 'NNP', 'NNPS']
pos_adj = ['JJ', 'JJR', 'JJS']
pos_verb = ['VB', 'VBD', 'VBG', 'VBP', 'VBZ']
# Function that returns dictionary summarizing the r/UMD activity of a particular user
def analyze_user(user):
verbose = rank_verbose(user)
if(verbose[0] != -1):
posts = rank_posts(user)
comments = rank_comments(user)
# Get more data on user:
# Get some info on their posts
first_post_date = None
first_post_title = "NA"
first_post_url = "NA"
first_post_karma = 0
post_karma = 0
pop_post_karma = 0
pop_post_title = "NA"
pop_post_url = "NA"
hated_post_karma = 0
hated_post_title = "NA"
hated_post_url = "NA"
# sentiment dictionary will temporarily store just the counts for each sentiment
sentiment = {'positive': 0, 'neutral' : 0, 'negative': 0}
for index, row in df_post[df_post['name'].str.lower() == user.lower()].iterrows():
# initialize
if(first_post_date == None):
first_post_date = row['created_utc']
first_post_title = row['title']
first_post_url = row['url']
pop_post_karma = row['score']
hated_post_karma = row['score']
first_post_karma = row['score']
else:
if(row['created_utc'] < first_post_date):
first_post_date = row['created_utc']
first_post_title = row['title']
first_post_url = row['url']
first_post_karma = row['score']
post_karma += row['score']
if(row['score'] >= pop_post_karma):
pop_post_karma = row['score']
pop_post_title = row['title']
pop_post_url = row['url']
if(row['score'] <= hated_post_karma):
hated_post_karma = row['score']
hated_post_title = row['title']
hated_post_url = row['url']
sentiment[row['sentiment']] += 1
# Get some info on their comments
first_comment_date = None
first_comment_body = "NA"
comment_karma = 0
pop_comment_karma = 0
pop_comment_body = "NA"
hated_comment_karma = 0
hated_comment_body = "NA"
for index, row in df_comment[df_comment['name'].str.lower() == user.lower()].iterrows():
# initialize
if(first_comment_date == None):
first_comment_date = row['created_utc']
first_comment_body = row['body']
pop_comment_karma = row['score']
hated_comment_karma = row['score']
else:
if(row['created_utc'] < first_comment_date):
first_comment_date = row['created_utc']
first_comment_body = row['body']
comment_karma += row['score']
if(row['score'] >= pop_comment_karma):
pop_comment_karma = row['score']
pop_comment_body = row['body']
if(row['score'] <= hated_comment_karma):
hated_comment_karma = row['score']
hated_comment_body = row['body']
sentiment[row['sentiment']] += 1
total_karma = post_karma + comment_karma
# Find what word the user posts most often
favorite_word = pd.DataFrame(words_frame[words_frame['user'].str.lower() == user.lower()]['word'].value_counts()).reset_index().at[0,'index']
# Repeat this, but for each part of speech
favorite_noun = pd.DataFrame(words_frame[(words_frame['user'].str.lower() == user.lower()) &
(words_frame['pos'].isin(pos_noun))]['word'].value_counts()).reset_index().at[0,'index']
favorite_adj = pd.DataFrame(words_frame[(words_frame['user'].str.lower() == user.lower()) &
(words_frame['pos'].isin(pos_adj))]['word'].value_counts()).reset_index().at[0,'index']
favorite_verb = pd.DataFrame(words_frame[(words_frame['user'].str.lower() == user.lower()) &
words_frame['pos'].isin(pos_verb)]['word'].value_counts()).reset_index().at[0,'index']
# Calculate percentage of posts with each sentiment
sentiment['positive'] = 100 * (sentiment['positive'] / (posts[1] + comments[1]))
sentiment['neutral'] = 100 * (sentiment['neutral'] / (posts[1] + comments[1]))
sentiment['negative'] = 100 * (sentiment['negative'] / (posts[1] + comments[1]))
# create final dictionary
results = { 'user': user, 'num_posts': posts[1], 'posts_rank': posts[0],
'num_comments': comments[1], 'comments_rank': comments[0],
'num_words': verbose[1], 'words_rank': verbose[0],
'first_post_date_utc': first_post_date,
'first_post_date': time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(first_post_date)),
'first_post_title': first_post_title, 'first_post_url': first_post_url,
'first_post_karma': first_post_karma,
'umd_post_karma': post_karma, 'pop_post_karma': pop_post_karma, 'pop_post_title': pop_post_title,
'pop_post_url': pop_post_url, 'worst_post_title': hated_post_title, 'worst_post_karma': hated_post_karma,
'worst_post_url': hated_post_url, 'first_comment_date_utc': first_comment_date,
'first_comment_date': time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(first_comment_date)),
'first_comment_body': first_comment_body, 'umd_comment_karma': comment_karma,
'pop_comment_karma': pop_comment_karma, 'pop_comment_body': pop_comment_body,
'worst_comment_karma': hated_comment_karma,
'worst_comment_body': hated_comment_body, 'total_umd_karma': total_karma, 'favorite_word': favorite_word,
'favorite_adj': favorite_adj, 'favorite_verb': favorite_verb, 'favorite_noun': favorite_noun,
'sentiment': sentiment}
return results
else:
return None
Now, we can view data for any Reddit user who is part of r/UMD. For example:
vorsteg = analyze_user('vorstegasauras')
dots = analyze_user('UMD_DOTS')
dickerson = analyze_user('ProfJohnDickerson')
miseryy = analyze_user('Miseryy')
print('u/Vorstegasauras\'s most popular post is ', vorsteg['pop_post_title'], 'and you can view it at',
vorsteg['pop_post_url'] + '.')
print('u/UMD_DOTS\' favorite noun is \'' + dots['favorite_noun'] + '.\'')
print('u/ProfJohnDickerson\'s first comment was made', dickerson['first_comment_date'] + '.')
print(str(miseryy['sentiment']['positive']) + '% of u/Miseryy\'s posts and comments have a positive sentiment.')
# Create the plot containing the top words of all types
all_word_counts = words_frame['word'].value_counts()
all_words_fig = go.Figure(
data=[go.Bar(x=all_word_counts.head(50).index, y=all_word_counts.head(50))],
layout_title_text="Words with Most Occurrences in r/UMD"
)
all_words_fig.update_xaxes(title_text='Word')
all_words_fig.update_yaxes(title_text='Number of Occurrences in r/UMD')
all_words_fig.show()
While the above plot shows the most popular words in r/UMD, "UMD" itself is the only word specific to University of Maryland (or even higher education in general) that appears in this plot. The rest are all quite generic, mostly consisting of articles, prepositions, and pronouns. We will address this, but first, let's take a look at a pattern that arises in the word frequencies themselves.
According to Zipf's Law, the number of occurrences of a word in nearly any body of text is inversely proportional to its rank.
Given that word p with rank(p) = 1 has is known to have occ(p) occurrences, a word w would have:
occ(w) ≈ occ(p) * (1 / rank(w))
However, it appears that r/UMD may not actually follow Zipf's law. Looking at the plot above, the second most popular word ("to") should have approximately half as many occurrences as the most popular word ("the"), and the third most popular word ("I") should have approximately one third as many occurrences as "the," and so on. However, we can easily see from the plot that this is not the case, as the next five words after "the" have at least half as many occurences as "the," which is far more for each than they would be predicted to have based on Zipf's law.
Let's investigate this on a larger scale. Using Zipf's law, we will compute the predicted number of occurences for each word based on the number of occurences of the most popular word and each word's ranking. Then, we will visualize how closely r/UMD follows Zipf's law with a plot of the Zipf's law predictions versus the actual number of occurences. If it follows Zipf's law closely, the slope of a linear regression line on the plot will be approximately 1.
# Create dataframe from the series containing the value counts from each word, rename column names appropriately
all_word_counts_frame = pd.DataFrame(all_word_counts).reset_index()
all_word_counts_frame.rename(columns={'index':'word','word':'actual_count'})
# Add column for the predicted occurrence value according to Zipf's law
all_word_counts_frame.insert(2, 'zipf_count', 0)
# Get the count of the number of occurrences of the most popular word
first_count = all_word_counts_frame.iat[0, 1]
for index, row in all_word_counts_frame.iterrows():
if(index == 0):
all_word_counts_frame.at[index, 'zipf_count'] = first_count
else:
all_word_counts_frame.at[index, 'zipf_count'] = first_count * (1 / (1 + index))
# Create the plot comparing the predicted number of occurrences to the actual number of occurrences for each word.
zipf_fig = px.scatter(
# Use .head(1500) to limit our plot to the first 1500. Any more than that slows down the notebook too much when viewing.
x=all_word_counts_frame[all_word_counts_frame.columns[2]].head(1500),
y=all_word_counts_frame[all_word_counts_frame.columns[1]].head(1500),
trendline='ols',
title="Actual Number of Occurences vs. Expected Number of Occurences for Words in r/UMD"
)
zipf_fig.update_yaxes(title_text='Actual Number of Occurences')
zipf_fig.update_xaxes(title_text='Expected Number of Occurences based on Zipf\'s Law')
zipf_fig.show()
The slope of the regression line in the plot above is 1.476779, which is significantly higher than 1. Looking at the plot, we can see that the slope would be even higher than that if not for the word with rank 1 (the expected number of occurences for this word has to be equal to the actual number of occurences). Thus, it is clear that the actual number of occurences for each word tends to be much higher than the prediction based on Zipf's law, as was seen in the previous bar plot.
Thus, without the inclusion of some sort of additional coeffient(s) in the calculations of the expected number of occurences, r/UMD does not appear follow Zipf's law very closely. It's impossible to be certain about the reason for this, but this may be because r/UMD is an internet forum filled with abbreviations and misspellings, some of which might be on purpose. Grammar is mostly optional for such forums, which may result in certain words that would otherwise have more occurences occurring only a bit more than the word that would be the next most popular word.
We previously defined pos_noun, pos_adj, and pos_verb, each of which contains the codes defined by NLTK that are associated with certain parts-of-speech (nouns, adjectives, and verbs). We will use these in the following plots to see what the most popular words of each category are, and hopefully find some more popular words commonly associated with University of Maryland.
# Create the plot containing the top nouns
noun_counts = words_frame[words_frame['pos'].isin(pos_noun)]['word'].value_counts().head(50)
noun_fig = go.Figure(
data=[go.Bar(x=noun_counts.index, y=noun_counts)],
layout_title_text="Nouns with Most Occurrences in r/UMD"
)
noun_fig.update_xaxes(title_text='Noun')
noun_fig.update_yaxes(title_text='Number of Occurrences in r/UMD')
noun_fig.show()
From the above plot of the most popular nouns of r/UMD, several words immediately stand out as being specific to University of Maryland, the culture of the subreddit, and higher education in general. Some of these include "class," "UMD," "semester," "campus," "students," "CS," "course," "college," "professor," "room," "math," and "program." Immediately, we can see that this plot is more relevant than the plot with all words included.
# Create the plot containing the top adjectives
adj_counts = words_frame[words_frame['pos'].isin(pos_adj)]['word'].value_counts().head(50)
adj_fig = go.Figure(
data=[go.Bar(x=adj_counts.index, y=adj_counts)],
layout_title_text="Adjectives with Most Occurrences in r/UMD"
)
adj_fig.update_xaxes(title_text='Adjective')
adj_fig.update_yaxes(title_text='Number of Occurrences in r/UMD')
adj_fig.show()
While the above plot of the most popular adjectives in r/UMD doesn't have as many stand-out UMD-related words at first glance, there are still several words classified as adjectives that are certainly relevant. For example, "major," "umd," and "final" are obvious instances of this. Less obviously, words such as "easy," "difficult," and "hard" are commonly used to described classes or assignments.
We can begin to see some of the limitations of our tokenization and the NLTK's parts-of-speech tokenization in this plot as well, with "words" such as "t" and "it's" being classified as adjectives.
On a lighter note, it is reassuring to see that "good" is the most popular adjective, being approximately three times as popular as "bad." It's always nice to have a positive attitude.
# Create the plot containing the top verbs
verb_counts = words_frame[words_frame['pos'].isin(pos_verb)]['word'].value_counts().head(50)
verb_fig = go.Figure(
data=[go.Bar(x=verb_counts.index, y=verb_counts)],
layout_title_text="Verbs with Most Occurrences in r/UMD"
)
verb_fig.update_xaxes(title_text='Verb')
verb_fig.update_yaxes(title_text='Number of Occurrences in r/UMD')
verb_fig.show()
The plot of the most popular verbs of r/UMD is much less obviously specific to r/UMD. However, there are certain words that make a lot of sense to appear often in a college forum. For instance, "taking," "take," and "took" (as in "to take a class"), were very popular, as were "work" and "help."
However, we can see some additional issues that have evidently arisen from our tokenization and parts-of-speech classification, specifically in the words "s" and "m."
On a side note, the above graph displays behavior more a bit more in line with what we might expect from Zipf's law, in that the second most popular word is about half as popular as the first.
Overall, of these word-occurrence plots, the plot of the most popular nouns appears to be the most useful due to the quantity of nouns that can easily be directly associated with University of Maryland that appeared in the plot.
'occurrences' : Total number of occurrences of the word in r/UMD
'occurrences_posts' : Total number of occurrences of the word in titles of posts to r/UMD
'occurrences_descriptions' : Total number of occurrences of the word in descriptions of posts to r/UMD
'occurrences_comments' : Total number of occurrences of the word in comments posted to r/UMD
'pop_source' : The most common source (title of post, post description, or comment) of the word in r/UMD
'first_occurrence_date' : The date and time that the word first appeared on r/UMD
'first_user' : The username of the user who first posted something containing the word to r/UMD
'fave_user' : The username of the user that has used this word the most out of all users in r/UMD
'fave_user_count' : The number of times the user that used this word the most has said the word
'rank_word' : The ranking in popularity for this word in r/UMD
'sentiment' : Gives a dictionary containing the percentages of the occurrences that occurred within posts or comments with certain sentiments. The keys for this dictionary are 'positive', 'neutral', and 'negative'.
# Function that returns dictionary summarizing the data about a particular word's usage in r/UMD
def analyze_word(query_word):
# Find the number of occurrences of the word and the word's popularity ranking within r/UMD
rank_word = -1
occurrences = 0
occurrences_posts = 0
occurrences_desc = 0
occurrences_comments = 0
pop_source = 'NA'
first_user = 'NA'
fave_user = 'NA'
fave_user_count = -1
first_occurrence_date = -1
sentiment = {'positive': 0, 'neutral': 0, 'negative': 0}
for index, row in pd.DataFrame(words_frame['word'].value_counts()).reset_index().iterrows():
if(row['index'].lower() == query_word.lower()):
if(rank_word < 0):
rank_word = 1 + index
occurrences += row['word']
if(rank_word != -1):
# filter out all the other words that aren't relevant
query_words_frame = words_frame[words_frame['word'].str.lower() == query_word.lower()]
# Make dataframe witht the counts for the number of times the word came from a specific source
query_words_source_frame = pd.DataFrame(query_words_frame['source'].value_counts()).reset_index()
# Find the number of occurrences of the word from each source
for index, row in query_words_source_frame.iterrows():
if(row[0] == 'title'):
occurrences_posts = row['source']
elif(row[0] == 'description'):
occurrences_desc = row['source']
else:
occurrences_comments = row['source']
# Determine the most popular source of the word
pop_source = query_words_source_frame.iat[0,0]
# Find the date of the first occurrence of the word, the first user to say it,
# and count the number of times the word appeared in text classified to have each sentiment
for index, row in query_words_frame.iterrows():
if(first_occurrence_date == -1):
first_occurrence_date = row['date']
first_user = row['user']
elif(first_occurrence_date > row['date']):
first_occurrence_date = row['date']
first_user = row['user']
# increment the sentiment count for the type of sentiment the word's associated text has
sentiment[row['sentiment']] += 1
# Determine which user has said the word the most and how many times they've said it
users_counts = pd.DataFrame(query_words_frame['user'].value_counts()).reset_index()
if(users_counts.at[0, 'index'] != 'None'):
fave_user = users_counts.iat[0, 0]
fave_user_count = users_counts.iat[0, 1]
else:
fave_user = users_counts.iat[1, 0]
fave_user_count = users_counts.iat[1, 1]
# Convert sentiments to percentages
sentiment['positive'] = 100 * (sentiment['positive'] / occurrences)
sentiment['neutral'] = 100 * (sentiment['neutral'] / occurrences)
sentiment['negative'] = 100 * (sentiment['negative'] / occurrences)
# Create dictionary of results
results = {'rank_word' : rank_word, 'occurrences' : occurrences, 'occurrences_posts' : occurrences_posts,
'occurrences_descriptions' : occurrences_desc, 'occurrences_comments' : occurrences_comments,
'pop_source' : pop_source, 'first_occurrence_date_utc': first_occurrence_date,
'first_occurrence_date': time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(first_occurrence_date)),
'first_user': first_user, 'fave_user': fave_user, 'fave_user_count' : fave_user_count,
'sentiment': sentiment}
return results
Now we can find data on any word's presence in r/UMD:
umd = analyze_word('umd')
maryland = analyze_word('maryland')
loh = analyze_word('loh')
tastings = analyze_word('tastings')
tendies = analyze_word('tendies')
dickerson = analyze_word('Dickerson')
print('\"UMD\" is ranked', umd['rank_word'], 'in terms of popularity, with', umd['occurrences'], 'occurrences.')
print('\"Maryland\" is ranked', maryland['rank_word'], 'in terms of popularity, with', maryland['occurrences'], 'occurrences.')
print('\"Loh\" was first mentioned', loh['first_occurrence_date'], 'by u/' + loh['first_user'] + '.')
print('\"Tastings\" has been mentioned the most by u/' + tastings['fave_user'] + ',', tastings['fave_user_count'], 'times.')
print('\"Tendies\" have been mentioned', tendies['occurrences'], 'times since their first mention',
tendies['first_occurrence_date'] + '.')
print(str(dickerson['sentiment']['positive']) + '% of the mentions of \"Dickerson\" in r/UMD have had a positive sentiment.')
We'll now define two functions: one that displays a graph of any word's frequency of occurrences over time, and another that displays a graph of any word's sentiment over time. The sentiment graphs will feature a LOWESS (Locally Weighted Scatterplot Smoothing) curve to make it easier to visualize the sentiment over time.
def word_time_plot(query_word):
# Filter out all words except for the query word
query_frame = words_frame[words_frame['word'] == query_word.lower()]
# Get the number of times the word was posted each day
date_counts = query_frame['date_ymd'].value_counts()
# Make the plot
query_fig = go.Figure(
# For x, we are converting from Unix time to the datetime object so that the plot is meaningful
# (we don't want to display our data in terms of seconds since 1970)
data=[go.Bar(x=pd.DataFrame(date_counts).reset_index()['index'].apply(lambda x: dt.datetime.fromtimestamp(x)),
y=date_counts)],
# Add a title
layout_title_text='Occurrences of \"' + str(query_word) + '\" vs. Time',
# Change plot background color to black
# (we need to do this because the default light gray background color makes the bars nearly invisible when zoomed out)
layout_plot_bgcolor='rgb(0,0,0)'
)
# Label the axes
query_fig.update_xaxes(title_text = 'Time')
query_fig.update_yaxes(title_text = 'Occurrences')
return query_fig
def sentiment_time_plot(query_word):
# Filter out all words except for the query word
query_frame = words_frame[words_frame['word'] == query_word.lower()]
# Make the plot
query_fig = px.scatter(
# For x, we are converting from Unix time to the datetime object so that the plot is meaningful
# (we don't want to display our data in terms of seconds since 1970)
x=query_frame['date'].apply(lambda x: dt.datetime.fromtimestamp(x)),
y=query_frame['sentiment'].apply(lambda x: 1 if x == 'positive' else 0 if x == 'negative' else -1),
# Add a linear regression line
trendline="lowess",
# Add a title
title='Sentiment of \"' + str(query_word) + '\" vs. Time',
)
# Label the axes
query_fig.update_xaxes(title_text = 'Time')
query_fig.update_yaxes(title_text = 'Sentiment',
ticktext=["Negative", "Neutral", "Positive"],
tickvals=[-1, 0, 1])
return query_fig
Using these functions, we can see the occurrences and sentiment of any word over time. Let's take a look at a few examples.
word_time_plot('snow').show()
As is evident from the plot, there tends to be more mentions of "snow" during the winter months for each year. This is unsurprising and uninteresting, but it is a good indicator that our word_time_plot function is functional.
Let's take a look at the sentiment of the posts and comments containing "snow" over time.
sentiment_time_plot('snow').show()
From the LOWESS curve, it appears that the sentiment associated with "snow" tends to be more positive than negative. Given that adequate snowfall can result in cancelled classes, this is unsurprising. It also appears from the relatively flat curve that the general sentiment concerning snow has not changed much year-to-year.
This has been a controversial topic ever since University of Maryland switched from UMD-Secure to Eduroam as the primary source of Wi-Fi.
# Produce graph of Eduroam's appearances over time
word_time_plot('Eduroam').show()
# Find out which user has mentioned Eduroam the most
eduroam_analysis = analyze_word('Eduroam')
print('u/' + eduroam_analysis['fave_user'], 'has mentioned Eduroam the most out of everyone on r/UMD, with a total of',
eduroam_analysis['fave_user_count'], 'mentions.')
As can be seen in the above graph, before August of 2019, there were very few mentions of Eduroam. Beginning in August 2019 and continuing into the following months, however, "Eduroam" began to appear much more frequently, even appearing as high as 31 times on September 19, 2019. This rise in occurrences coincides with the start of the fall 2019 semester, the first semester in which Eduroam fully replaced UMD-Secure. Many users of Eduroam reported having more connectivity problems than they had with UMD-Secure, and such problems were discussed on r/UMD. Often, u/umdit, the official Reddit account for UMD's IT department, would be involved with discussions of the issues with other users, trying to help problem solve and addressing misconceptions (for example: "Eduroam uses the same infrastructure as umd-secure. We know you're having issues, let us help!"). Accordingly, u/umdit is the user that has mentioned "Eduroam" the most in r/UMD, a total of 93 times, as calculated by the analyze_word function we defined earlier.
The large spike in mentions of "Eduroam" occurring around September 19, 2019, is likely partially a result of the protest that occurred on September 17, 2019. In the following days, there were several posts that mocked the protesters' signs by editing photos of them so that they protested Eduroam, which then likely contributed to a new wave of posts complaining about the quality of Eduroam (and likely inspired posts such as this).
Let's now look at how the sentiment of the posts and comments containing "Eduroam" changes over time.
sentiment_time_plot('Eduroam').show()
The LOWESS curve shows us that r/UMD's opinion of Eduroam began dropping dramatically since the beginning of the fall 2019 semester. This makes sense given the many posts about problems with Eduroam that have appeared in r/UMD since it became the primary campus Wi-Fi network for the fall of 2019, as previously discussed.
Coach Durkin was embroiled in controversy following the death of UMD football player Jordan McNair in June 2018.
word_time_plot('Durkin').show()
There are several things to note about this plot of the occurrences of "Durkin." First, in the far left portion of the plot, there are two occurrences in December of 2015. This coincides with the initial hire of DJ Durkin as head coach of University of Maryland's football team, which was first reported on December 2, 2015.
The next relatively large cluster of "Durkin" mentions occurred between August 11 and August 19, 2018. This coincides with the time that Durkin was first placed on leave from his position as head coach of the football team, and there was much discussion on r/UMD about this. Interestingly enough, this is the first occasion that Durkin was mentioned in r/UMD after Jordan McNair's death on June 13, 2018, likely indicating that users of r/UMD did not initially consider Durkin to be at fault when news of McNair's death was first spread.
Finally, the largest spike occurred between October 29, 2019 and November 2, 2019. On October 30, the Board of Regents reinstated Durkin as head coach of the football team. On October 31, the very next day, President Loh fired DJ Durkin. Following that, on November 1, a fight broke out at the UMD football practice. All of these events were discussed on r/UMD as they were reported:
Durkin reinstated: "After Maryland Player’s Death, Coach and Athletic Director Keep Their Jobs - The New York Times"
Durkin fired: "Durkin Fired"
Football Fight: "Fight breaks out among Maryland football players at practice in wake of Durkin drama"
We'll do a plot of the appearances of "Loh" (as in President Wallace Loh) over time to see if his name has similar spikes as Durkin's name.
word_time_plot('Loh').show()
Loh has clearly been mentioned overall far more than DJ Durkin (which makes sense given that he is the president of the university), however, Loh shares the same spikes in mentions as Durkin in August 2018 and late October/early November of 2018. This makes sense, as Loh was the one to ultimately fire Durkin, and Loh's retirement was announced on the same day that Durkin was initially reinstated (October 30, 2019).
Let's now take a brief look at the sentiment for Durkin and Loh over time, starting with Durkin.
sentiment_time_plot('Durkin').show()
Although there were very few mentions of Durkin before August 2018, there is a significant noticable drop in sentiment level that takes place around August to November 2018. This makes sense given that this is the timeframe during which Durkin was placed on leave, reinstated, and then suspended.
sentiment_time_plot('Loh').show()
The sentiment of posts and comments containing mentions of "Loh" appears to have stayed approximately the same throughout r/UMD's existence, as the LOWESS curve nearly has a flat slope. Being the president of the university, Loh has always had and always will have supporters and critics. If we look closely, however, the slope of the trendline is ever-so-slightly negative, which may be partially a result of the controversy following McNair's death. The trendline overall appears to be much more positive than negative, but this may be able to be attributed to the VADER SentimentIntensityAnalyzer misclassifying sarcastic posts and comments as positive ones.
word_time_plot('Penn').show()
In examining the above plot, we can see that there have been various mentions of "Penn" (likely referring to Penn State) over the years, popping up now and then normally around when UMD and Penn State played each other in football (for example, October 24, 2015, October 8, 2016, etc.). However, in 2019, there was a major spike in mentions of "Penn" in September. This coincides with the major Maryland vs. Penn State home game that took place on Friday, September 27. The University had said that classes would not meet in person during the afternoon of the game, and in the weeks leading up to it, r/UMD was innundated with posts about obtaining and selling tickets to the game (for example, "Anyone wanna sell me a Penn State ticket", and "NEED PENN STATE TICKET TRYING TO BUY ONE BEFORE TODAY ENDS").
Following Maryland's 59-0 defeat during the game, several users posted memes about the loss: "UMD vs Penn State: A Halftime Report", "UMD Cheerleaders when a touchdown to put Penn State up 45 gets called back and we are only down 38"
Let's take a look at the sentiment of posts and comments mentioning "Penn" over time.
sentiment_time_plot('Penn').show()
This negatively-sloping LOWESS curve can easily be explained by an increase in posts insulting Penn State surrounding sporting events. For instance, see this post, as well as this post. Again, the trendline overall seems to be more positive than negative, which may result from the influence of sarcasm.
"Iribe" refers to Brendan Iribe, co-founder of Oculus VR and the namesake of the Iribe Center for Computer Science and Engineering.
word_time_plot('Iribe').show()
It is immediately clear from this plot that mentions of "Iribe" have dramatically increased in frequency since the first mention in 2014. This corresponds with the completion of the construction of the Iribe Center. Mentions of Iribe begin to increase dramatically throughout the first half of 2019, when portions of the building first began opening. This is also when the grand opening of the Iribe Center occurred, attended by Brendan Iribe himself on April 27, 2019. Following May 31, there are almost no mentions of Iribe until mid August (which corresponds to summer break, when relatively few people were on campus). Following that, a major influx of Iribe mentions comes during the fall 2019 semester as the building opened to classes completely.
Another interesting aspect of this plot is Iribe's first mention in r/UMD, on April 2, 2014. One might expect that this corresponds to the announcement of Brendan Iribe's donation to build the Iribe Center, however, that did not take place until September 11, 2014. Strangely enough, it appears that Iribe's donation was completely ignored by r/UMD when it was first announced, as there were no mentions of Iribe in September of 2014.
Let's dig a bit further using our analyze_word function on "Iribe" to find out who made the first post to mention the name:
print('First user to mention \"Iribe\": u/' + analyze_word('Iribe')['first_user'])
Let's now take a look at the first post by the user named "SyntheticBiology" and see if this is also the first mention of Iribe in r/UMD:
synth_bio = analyze_user('SyntheticBiology')
print('u/SyntheticBiology\'s first post on r/UMD was \"' + synth_bio['first_post_title'] + '.\"')
print('Its URL is', synth_bio['first_post_url'] + '.')
print('It received',synth_bio['first_post_karma'],'upvotes.')
Well, it's our lucky day! We've found the first post containing a mention of Iribe in r/UMD. While the URL that was posted appears to be a dead link, the post's title is enough to tell us that the post was about Iribe and Antonov coming to give a talk on April 4, 2014. Also occurring on April 4, 2014 was the start of the first Bitcamp, University of Maryland's largest hackathon, which was kicked off with a keynote speech by Iribe and Antonov, so it is likely that this post was referring to this speech at Bitcamp.
As described in this Washington Post article:
On a recent visit to U-Md. — where Iribe first met his business partner, Michael Antonov, in a freshman dorm in 1998 — the 35-year-old Californian attended a school-sponsored “hackathon,” in which students use technology to solve a problem in a short amount of time. He met with professors and spoke to hundreds of students, impressed with their energy. But walking into the computer science center on campus, he said he found the facility “depressing” and “a lot worse than I remembered it.” ("Brendan Iribe, co-founder of Oculus VR, makes record $31 million donation to U-Md.", by Nick Anderson)
This article states that Iribe was inspired to make the massive donation required to build the Iribe Center during a disappointing stop in one of the computer science buildings (likely A.V. Williams) while visiting UMD to speak at a hackathon. The hackathon referred to by the article had to be Bitcamp 2014. Because of this immensely important visit, today University of Maryland has the massive state-of-the-art building within which our CMSC320 lectures take place.
It's funny to think about the Iribe Center's very existence all started with a simple campus visit in April 2014 that was almost entirely ignored (having received a meager three upvotes) when first posted about to r/UMD. Not even u/SyntheticBiology could have predicted the far-reaching effects that visit would have at University of Maryland in the years to come.
Finally, let's take a look at the trend in sentiment for mentions of "Iribe."
sentiment_time_plot('Iribe').show()
While the general sentiment surrounding Iribe seems to be overwhelmingly positive, there is a noticable waver in the LOWESS curve beginning in mid-2018. This may be a result of controversies and complaints surrounding the new building, such as noise complaints, problems with the doors, complaints about roof access, and complaints about the building's usage, thrown in with a variety of more positive posts and comments.
place = 1
# Sort by the number of upvotes
for index, row in df_post.sort_values(by='score', ascending=False).head(10).iterrows():
print(str(place) + '.')
print('Post Title: \"' + row['title'] + '\"')
print('User: u/' + row['name'])
print('Score:', row['score'])
print('URL:', row['url'])
print('Date Posted:', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(row['created_utc'])))
print(header_str, header_str)
place += 1
The processes for finding the most upvoted posts and the least upvoted posts are identical, so we'll just define a function that can do both depending on the boolean that is passed into it.
# If worst is True, find the most downvoted comments, otherwise find the most upvoted comments
def best_worst_comments(worst):
if(worst == False):
print('Top ten most upvoted comments in r/UMD:')
else:
print('Top ten most downvoted comments in r/UMD:')
place = 1
# Sort by the number of upvotes
for index, row in df_comment.sort_values(by='score', ascending=worst).head(10).iterrows():
# We don't have URLs that link directly to the comments in this data, so we'll find the post.
parent_id = row['parent_id']
parent_url = 'Not Available'
# If the comment is a reply to another comment, we'll need to walk back up the chain until the parent is a post.
while(parent_id in df_comment['id']):
# find the index of the parent comment and use that to get the next parent
for index, item in df_comment['id'].iteritems():
if(item == parent_id):
parent_id = df_comment.iat[index, 4]
# the parent_id must refer to a post by the end of the while loop, so we'll get the URL from df_post
for index, item in df_post['id'].iteritems():
# Each ID from the comments has a 'tX_' prefix, so we'll check for a substring to ignore this part
if(item in parent_id):
parent_url = df_post.iat[index, 2]
print(str(place) + '.')
print('Comment: \"' + row['body'] + '\"')
print('User: u/' + row['name'])
print('Score:', row['score'])
print('Date Posted:', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(row['created_utc'])))
print('Parent Post:', parent_url)
print(header_str, header_str)
place += 1
# Call our function to get the most upvoted comments in r/UMD
best_worst_comments(False)
# Call our function to get the most downvoted comments in r/UMD
best_worst_comments(True)
Finally, let's go back to the genesis of r/UMD, and see what was going on in 2010.
place = 1
# Sort by the date
for index, row in df_post.sort_values(by='created_utc', ascending=True).head(10).iterrows():
print(str(place) + '.')
print('Post Title: \"' + row['title'] + '\"')
print('User: u/' + row['name'])
print('Score:', row['score'])
print('URL:', row['url'])
print('Date Posted:', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(row['created_utc'])))
print('~~~~~~~~~~ ~~~~~~~~~~')
place += 1
It is interesting to note that while r/UMD itself was created on April 15, 2010, it appears that the first post was not made until June 25, 2010.